{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.3 Uses of the standard Normal distribution\n", "\n", "Suppose we wanted to answer the following question: \n", "\n", "> What is the probability of having a 'healthy' weight?\n", "\n", "A healthy weight is often is often measured using the Body Mass Index (BMI - although see [here](https://www.health.harvard.edu/blog/how-useful-is-the-body-mass-index-bmi-201603309339) and [here](https://www.bbc.co.uk/news/health-43895508) for a discussion on why this may be too simplistic a measure). An individual's BMI can be calculated using their height and weight, using the formula BMI $= \\frac{mass(kg)}{height(m)^2}$. Then people can be classified as:\n", "\n", "| Classification | BMI | \n", "|:-|:-|\n", "| Underweight |<18.5 |\n", "| Normal | 18.5-24.9 |\n", "| Overweight | 25-29.9 |\n", "| Obese | 30-39.9 |\n", "| Extremely obese | >40 |\n", "\n", "To address our question, we will use data taken from a study undertaken among a group of 76 cleaners, that investigated whether telling the cleaners they had an active lifestyle influenced their BMI. We will assume that values of BMI approximately follow a normal distribution. We do not know the true values of $\\mu$ and $\\sigma$ so we will replace these with the sample mean and standard deviation. This gives us values of $\\mu=$26.5 and $\\sigma^2=$ 18.1, as demonstrated in the snippet of code below.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
CondAgeWtWt2BMIBMI2FatFat2WHRWHR2SystSyst2DiastDiast2
0 43 137 137.425.1 25.1 31.9 32.8 0.79 0.79 124 118 70 73
0 42 150 147.029.3 28.7 35.5 NA 0.81 0.81 119 112 80 68
0 41 124 124.826.9 27.0 35.1 NA 0.84 0.84 108 107 59 65
0 40 173 171.432.8 32.4 41.9 42.4 1.00 1.00 116 126 71 79
0 33 163 160.237.9 37.2 41.7 NA 0.86 0.84 113 114 73 78
0 24 90 91.816.5 16.8 NA NA 0.73 0.73 NA NA 78 76
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllllllll}\n", " Cond & Age & Wt & Wt2 & BMI & BMI2 & Fat & Fat2 & WHR & WHR2 & Syst & Syst2 & Diast & Diast2\\\\\n", "\\hline\n", "\t 0 & 43 & 137 & 137.4 & 25.1 & 25.1 & 31.9 & 32.8 & 0.79 & 0.79 & 124 & 118 & 70 & 73 \\\\\n", "\t 0 & 42 & 150 & 147.0 & 29.3 & 28.7 & 35.5 & NA & 0.81 & 0.81 & 119 & 112 & 80 & 68 \\\\\n", "\t 0 & 41 & 124 & 124.8 & 26.9 & 27.0 & 35.1 & NA & 0.84 & 0.84 & 108 & 107 & 59 & 65 \\\\\n", "\t 0 & 40 & 173 & 171.4 & 32.8 & 32.4 & 41.9 & 42.4 & 1.00 & 1.00 & 116 & 126 & 71 & 79 \\\\\n", "\t 0 & 33 & 163 & 160.2 & 37.9 & 37.2 & 41.7 & NA & 0.86 & 0.84 & 113 & 114 & 73 & 78 \\\\\n", "\t 0 & 24 & 90 & 91.8 & 16.5 & 16.8 & NA & NA & 0.73 & 0.73 & NA & NA & 78 & 76 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| Cond | Age | Wt | Wt2 | BMI | BMI2 | Fat | Fat2 | WHR | WHR2 | Syst | Syst2 | Diast | Diast2 |\n", "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| 0 | 43 | 137 | 137.4 | 25.1 | 25.1 | 31.9 | 32.8 | 0.79 | 0.79 | 124 | 118 | 70 | 73 |\n", "| 0 | 42 | 150 | 147.0 | 29.3 | 28.7 | 35.5 | NA | 0.81 | 0.81 | 119 | 112 | 80 | 68 |\n", "| 0 | 41 | 124 | 124.8 | 26.9 | 27.0 | 35.1 | NA | 0.84 | 0.84 | 108 | 107 | 59 | 65 |\n", "| 0 | 40 | 173 | 171.4 | 32.8 | 32.4 | 41.9 | 42.4 | 1.00 | 1.00 | 116 | 126 | 71 | 79 |\n", "| 0 | 33 | 163 | 160.2 | 37.9 | 37.2 | 41.7 | NA | 0.86 | 0.84 | 113 | 114 | 73 | 78 |\n", "| 0 | 24 | 90 | 91.8 | 16.5 | 16.8 | NA | NA | 0.73 | 0.73 | NA | NA | 78 | 76 |\n", "\n" ], "text/plain": [ " Cond Age Wt Wt2 BMI BMI2 Fat Fat2 WHR WHR2 Syst Syst2 Diast Diast2\n", "1 0 43 137 137.4 25.1 25.1 31.9 32.8 0.79 0.79 124 118 70 73 \n", "2 0 42 150 147.0 29.3 28.7 35.5 NA 0.81 0.81 119 112 80 68 \n", "3 0 41 124 124.8 26.9 27.0 35.1 NA 0.84 0.84 108 107 59 65 \n", "4 0 40 173 171.4 32.8 32.4 41.9 42.4 1.00 1.00 116 126 71 79 \n", "5 0 33 163 160.2 37.9 37.2 41.7 NA 0.86 0.84 113 114 73 78 \n", "6 0 24 90 91.8 16.5 16.8 NA NA 0.73 0.73 NA NA 78 76 " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[1] \"value of mu is 26.46\"\n", "[1] \"value of sigma is 4.25\"\n" ] } ], "source": [ "# BMI dataset\n", "dat <- read.csv(\"Practicals/Datasets/BMI/MindsetMatters.csv\")\n", "head(dat)\n", "#remove observations with no BMI data\n", "dat <- dat[!is.na(dat$BMI),]\n", "#estimate mu and sigma\n", "mu <- mean(dat$BMI)\n", "print(paste0(\"value of mu is \",round(mu,2)))\n", "sig <- sd(dat$BMI)\n", "print(paste0(\"value of sigma is \",round(sig,2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So what is the probability a randomly selected person in this sample has a normal BMI? \n", "\n", "#### Approach 1: Manual calculation\n", "One option is to make use of pre-calculated probabilities of the standard normal distribution. If we write $X$ to represent the value of a person's BMI, then we are assuming that\n", "\n", "$$\n", "X \\sim N(\\mu=26.5, \\sigma^2=18.1)\n", "$$\n", "\n", "To make use of the pre-calculated probabilities for the standard normal distribution, we must first transform our normally distributed variable to have a standard normal distribution. We know that the transformed variable $Z$ (the *Z score*) has a standard normal distribution, where\n", "\n", "$$\n", "Z = \\frac{X - \\mu}{\\sigma}\n", "$$\n", "\n", "Given values for $\\mu$ and $\\sigma$ we can go from the *X scale* to the *Z scale* and *vice versa*. The important point about describing a distribution on the Z scale is that this opens the ability to calculate specific probabilities. So back to answering the question...\n", "\n", "From the table above we can see that a normal weight is classified as a BMI between 18.5 and 24.9, and we want to know what the probability is that a randomly selected person falls between these limits. We write this as;\n", "\n", "$$\n", "P(18.5 < X < 24.9) \n", "$$\n", "\n", "On the Z-scale, this is equivalent to saying that:\n", "\n", "$$\n", "P(-1.87 < Z < -0.37) = P(Z < -0.37) - P(Z < -1.87)\n", "$$\n", "\n", "Tables exist containing a range of pre-calculated probabilities that a variable following a standard normal distribution takes a value of less than $k$, for a range of possible values of $k$. These are often called *z-tables* (found [online](http://www.z-table.com/) or at the back of most stats books). From these tables, we can look up the corresponding probability for each z-score, giving:\n", "\n", "$$\n", "0.3557 - 0.0307 = 0.325\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Approach 2: Using R to do the same calculation\n", "\n", "Using this approach, R is ultimately using the same pre-calculated probability tables. However, it is considerably quicker and easier to ask R to look up the values rather than finding them in tables." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"z_max is -0.37 and z_min is -1.87\"\n", "[1] \"Probability of having a healthy BMI is (z-score) 0.326\"\n" ] } ], "source": [ "# a) if we were to use Z tables within R (to illustrate the point)\n", "\n", "z_min <- (18.5-mu)/sig\n", "z_max <- (24.9-mu)/sig\n", "\n", "# note when using pnorm we don't need to specify mu and sigma as the \n", "# function assumes mu=0 and sigma=1 unless specified.\n", "print(paste0(\"z_max is \",round(z_max,2),\" and z_min is \",round(z_min,2)))\n", "print(paste0(\"Probability of having a healthy BMI is (z-score) \",round(pnorm(z_max)-pnorm(z_min),3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Approach 3: Using R to do the calculation on the untransformed scale" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Probability of having a healthy BMI is (direct) 0.326\"\n" ] } ], "source": [ "# b) if we were to directly estimate\n", " \n", "print(paste0(\"Probability of having a healthy BMI is (direct) \",round(pnorm(24.9,mu,sig)-pnorm(18.5,mu,sig),3)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculating directly gives the same result as using a z-score in R, and this returns the same information as using z-tables.\n", "\n", "In answer to our question, we estimate that the probability of having a 'healthy' weight is 32.6\\%. We can compare this to the observed proportion of our sample of data with a 'healthy' BMI." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"Within the data a healthy BMI is seen 35.1%\"\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAFoCAMAAAC46dgSAAAAPFBMVEUAAAAzMzNGgrRNTU1o\naGh8fHyMjIyampqnp6eysrK9vb3Hx8fMzMzQ0NDZ2dnh4eHp6enr6+vw8PD////IVY4uAAAA\nCXBIWXMAABJ0AAASdAHeZh94AAAJkUlEQVR4nO3d6ULiShCG4daoMI6jHLn/ez2ELYvptbqw\nU3m/H2Owy0rJM4GIS9yRmI777QGIbgA2HoCNB2DjAdh4ADYegI0HYOORAP93yvmfpPyodMmV\nGU09cZJBK1Q+evcAV+gJcGQcgPUqAa7QE+DIOADrVQJcoSfAkXEA1qsEuEJPgCPjAKxXCXCF\nngBHxgFYrxLgCj0BjowDsF6lGeCPcQJ1AK8V+HkIwKMlgP0BGODsSoATB58HYM8SwP4ADHB2\nJcCJg88DsGepCrAwNXY9AVbe1zrDESzYvfUjWDgOwHqVAAcCMMDZlQAnDj4PwJ4lgP0BGODs\nSoATB58HYM8SwP4ADHB2JcCJg88DsGcJYH8ABji7EuDEwecB2LMEsD8AA5xdCXDi4PMA7FkC\n2B+AAc6uBDhx8HkA9iwB7A/AqwaO/z4iwOsGjn4EwADnDwpw2uDzAOxZAhhggKMBODgOwHqV\nAAMMcDwAB8cBWK8SYIBv6c4Z3wC4YNB2gS+ws7cA5w7aNnD3YwPgzEHXATz1BdgK8HAA35+C\nn/qk/N8I5Hf+CEvqR1hJHvDkFkewuSN4dks4DsB6lVnAne+mcByA9SoLgXmILh50PcCj41k4\nDsB6lSXAZ93xC1kAGwH2RjgOwHqVAAMMcDwAB8cBWK8SYIABjgfg4DgA61UCDDDA8QAcHAdg\nvUqAAQY4HoCD4wCsVwkwwADHA3BwHID1KgEGGOB4AA6OA7BeJcAAAxwPwMFxANarBBhggOMB\nODgOwHqVAAMMcDwAB8cBWK8SYIABjsc6sDD8EZZHhCNYsHvrR7BwHID1KgEGGOB4AA6OA7Be\nJcAAAxwPwMFxGgBOi7crwMFxGgB+9tyYLni7AhwcB2C9SoABBhhg4TgA61UCDDDAAAvHAViv\nEmCAAQZYOA7AepUAAwwwwMJxANarBBhggAEWjgOwXiXAAAMMsHAcgPUqAQYYYICF4wCsV+kF\ndtfb3eyy3wALB20CuHOjAGwP+H3k+z6wj64Kbe0C0RsDPg4P0ePjerY53BaOA7BepRd4IQB7\nu64ReN/NnoPHZ1sAFw/aCvD+x0nW5Cn4/s9Tn+jBH07hV2jT3/wbQyTtqwC4bM5GMruXu9HZ\n1fU9938aOYL9KoEP4gi+3Vw+rAAuGLSwVBf4zX0DbBn40L0eFmwBLhi0sFT7IfrHSdbIFOCc\nQQtLHw18f/WqG20DnDBoYenmv5sEcEIlwFsFXsF3kwBOqAR4q8CXHF7/JPgCvFrg47dLERaO\nA7BeZQzY95IlwBmDFpY+BPiva/dnsgBOqPQC38+x9gBbBu5SfAFeIXBWhOMArFcJ8GaBv/cv\nzr3sF78rDHDWoIWlusCH68/cdbPvCgOcP2hhqS7wzvXf8D+8uh3AJoFvL3DwQgfAAK8RmIdo\n48CcZBkH5ssk68A5EY4DsF4lwJsFfju/w73wHGwTeH/5+sit/ix6dvHBBoF9l0fUBe7cZ//m\na/VfB88WWgT27IIXOgCOV3qB39zuu/9ayb0CbBL4/kLHF8AmgW8vdKScRAO8RuCcCMcBuHVg\nYUr/CItXxVf10eQfYfGOXjccwXHg9E/Jn1UewcJxAAZ4KQAnVAIMMMAAC8cBGOClAJxQCTDA\nAAMsHAdggJcCcEIlwAADDLBwHIABXgrACZUAAwwwwMJxAAZ4KQAnVAIMMMAAC8cBGOClFAB/\nfLjlqx2mAXt+DRDgyDiPA35+dgmOSQsZg0ZKAfYE4IRKgAEGGGDhOAADvBSAEyoBBhhggIXj\nAAzwUgBOqMwBHl8UuptcIVo4DsBNAE8u6z69HppwHIABXgrACZUZwGPY2QUNheMA3B7w/Sn4\nqU/Sh/rzuL+yUxFY+Dn7JqzXdp6ke3nyCM1JVuKgkdKWjuBu+YZwHIBbAe48t4TjANwIcDfd\nAjhx0EhpM8DddHN0UzgOwE0Ad7dT5+44fVULYBvAgQjHARjgpQCcUAkwwAADLBwHYICXAnBC\nJcAAAwywcJyfwB9J8d75vqqawN5kY/jbFt+jo6VGgWV3fmChGrC3KBujuFWg57AEsHwfAEfG\nAbisVaDnsASwfB8AR8YBuKxVoOewBLB8HwBHxgG4rFWg57AEsHwfAEfGAbisVaDnsASwfB8A\nR8YBuKxVoOewBLB8HwBHxgG4rFWg57AEsHwfAEfGAbisVaDnsASwfB8AR8YBuKxVoOewBLB8\nHwBHxgG4rFWg57AEsHwfAEfGAbisVaDnsFQFWJifu14zcPanX7FVKBzBxQv2j+DITmLjAFzW\nKtBzWAJYvg+AI+MAXNYq0HNYAli+D4Aj4wBc1irQc1gCWL6PLQKHfktuO8BpvyRZ0CrxntYE\n9g26LWBR20CrxHsa4Dr7ABjgsv0BnL8AMMAAAwwwwAADXN4KYO8CwAADDDDAAAMMcHkrgL0L\nAAMMMMAAAwwwwOWtAPYuAAwwwAADDDDAAJe3Ati7ALA/42sGx64fDPD6gMeXdY9e4h1ggAEu\nbwWwdwHgTOCnPj+LE3/t6hKXVS3LI/f1Cwn41T2CQ0k4gn2VGU09cZJBK1Q+evcAV+gJcGQc\ngPUqAa7QE+DIOADrVWYA31+96kbbAGdWtgzsj3AcgPUqAa7QE+DIOADrVQJcoSfAkXEA1qsE\nuEJPgCPjAKxXWQe4z8K3lMTR6LnZQQFurCfADTVdw6AAN9azNWDSeAA2HoCNB2DjAdh4ADYe\nCfDlhzu66Y95iHL/6ZF6LadNNzeoBPg6RL1P8P4zX7Of/arTdJODSoC7I8DND1rhIbrqNNeG\n9Rt3mx1UDlz1aeioeL9tc9A6R3DlR7/q91v9pheDyoNqNJWfRU+3xFEDnm7UaapyBE/eitMa\nsNqjwnyrStctAFd+5Js1brSp6ll0Yw/RVc/ru1nj2k2bthg3begkq+aLOd3tBLLmeaRa0/Hb\ndpvyWrTxAGw8ABsPwMYDsPEAbDwAGw/AxgOw8WwY2F2y+zxtd2/vh/M7D+9vXb/0u6NVjJ3P\nJDvuln/n7d35nTvX4wJsIVfFP+6l3365vrL+ArCV3BQvon9c/1D9eXoLsJHcj+B9v32iPd/4\nBNhKbk/Bu/P26cH59PbFHQG2khvw23n7dHp1OB5O2gBbyVXx7+Uh+vjPvR/f3V+AzeR+knX5\nwvfbvR5f3TfAZjI9iz72uidjgM3kovi975+E++1399afSQNsJfdXsr4uoodhE2ALueh2u6/j\nVbTrn40BJqsKwMYDsPEAbDwAGw/AxgOw8QBsPAAbD8DGA7DxAGw8/wOIgTQYMSd6RgAAAABJ\nRU5ErkJggg==", "text/plain": [ "plot without title" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# c) provide a sanity check against the data\n", "options(repr.plot.width=4, repr.plot.height=3)\n", "library(ggplot2)\n", "\n", "ggplot(dat,aes(x=BMI)) + geom_histogram(bins = 30,fill=\"steelblue\",col=\"grey80\") +\n", " geom_vline(xintercept = c(18.5,24.9))\n", "#hist(dat$BMI,col=\"steelblue\")\n", "#abline(v=c(18.5,24.9),lty=2)\n", "print(paste0(\"Within the data a healthy BMI is seen \",\n", " round(100*((sum(dat$BMI<24.9)-sum(dat$BMI<18.5))/length(dat$BMI)),1),\"%\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we can see that the sample estimate (35.1%) is roughly similar to the population estimate of 32.6%. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 4 }